feat(graphrag): tenant-partition in-memory stores + query ctx (RAN-37)#27
Conversation
Introduces a tenant dimension to every GraphRAG in-memory store and the query surface, so every ingest and query lands in the calling tenant's slice of the graph. Architecture - New tenantStores composite bundles one tenant's ServiceStore, TraceStore, SignalStore, AnomalyStore. GraphRAG drops the four direct-store fields and instead holds tenants map[string]*tenantStores guarded by its own RWMutex; slices are lazy-created via storesFor(ctx) / storesForTenant(tenant). The default tenant is seeded at New() so background loops have a baseline to iterate. - spanEvent / logEvent / metricEvent each carry Tenant string; OnSpanIngested / OnLogIngested / OnMetricIngested derive it from storage.Span / storage.Log / tsdb.RawMetric TenantID fields. Callback signatures are intentionally unchanged — tsdb.RawMetric already carries TenantID from ingest/otlp.go, so no second argument or struct change is needed for the metric path. - Every public query method takes ctx as its first argument and routes through storesFor(ctx): ErrorChain, ImpactAnalysis, RootCauseAnalysis, DependencyChain, CorrelatedSignals, ShortestPath, AnomalyTimeline, ServiceMap, SimilarErrors. Two narrow helpers (AllServiceEdges, AnomaliesForService) let handlers read store data without reaching into the composite directly. - Background loops (refresh, snapshot, anomaly) iterate a stable snapshot of tenants on each tick, wrap ctx with storage.WithTenantContext(ctx, tenant), and act on each slice under its own per-store lock. rebuildAllTenantsFromDB enumerates tenants from the spans table so historical data from tenants without live callbacks is still recovered. - The Drain log-template miner stays shared across tenants — templates describe log shape, not content — while per-tenant SignalStore maps keep cluster nodes isolated. Persisted tables (investigations, graph_snapshots, drain_templates) intentionally unchanged; Subtask B (RAN-38) owns the tenant_id column migration. PersistInvestigation now accepts tenant as its first arg so the column can be wired in B without a signature break. External callers updated - internal/mcp/tools.go: all GraphRAG-backed tools take ctx and thread it through the query call; AnomalyStore direct access replaced with AnomaliesForService. - internal/api/graph_handler.go: handleGetSystemGraph uses r.Context() throughout, the 10s response cache is tenant-keyed so cross-tenant cache hits are impossible, and ServiceStore direct access replaced with AllServiceEdges. Tests - builder_test + investigation_cooldown_test updated to new signatures. - New tenant_isolation_test covers the core RAN-37 invariant: two tenants ingesting overlapping data see only their own services / traces, and empty-context lookups collapse to DefaultTenantID. - go build ./... && go vet ./... && go test ./... all green; go test -race ./internal/graphrag/... clean. Scope - Out of scope: persisted-table tenant columns (Subtask B, RAN-38), MCP-level tenant-isolation integration test (Subtask C, RAN-39 — the merge gate for RAN-19). internal/graph/ legacy package untouched. Refs RAN-19 RAN-21 Co-Authored-By: Paperclip <noreply@paperclip.ing>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 901a4aee9a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Table("spans"). | ||
| Select("span_id, parent_span_id, service_name, operation_name, duration, trace_id, status, start_time"). | ||
| Where("start_time > ?", since). | ||
| Where("start_time > ? AND tenant_id = ?", since, tenant). |
There was a problem hiding this comment.
Include empty tenant_id rows in default rebuild query
The per-tenant rebuild now filters with tenant_id = ?, which causes the default-tenant pass to miss historical spans whose tenant_id is NULL/empty (common for pre-backfill data). That means restart recovery can produce an empty in-memory graph even though recent spans exist. Since rebuildAllTenantsFromDB already normalizes empty tenant IDs to DefaultTenantID, this query should also include empty/NULL tenant rows when rebuilding the default tenant.
Useful? React with 👍 / 👎.
| // anomaly loop) can re-enter ImpactAnalysis on the correct tenant slice. The | ||
| // `investigations` table itself does not yet carry a tenant_id column — | ||
| // Subtask B (RAN-38) owns that migration. | ||
| func (g *GraphRAG) PersistInvestigation(tenant, triggerService string, chains []ErrorChainResult, anomalies []*AnomalyNode) { |
There was a problem hiding this comment.
Scope investigation cooldown key by tenant
After this change, PersistInvestigation is invoked per tenant, but the cooldown dedup key still excludes tenant identity and only uses trigger/root fields. In a multi-tenant setup, two tenants with the same service/root cause within the cooldown window will suppress each other, dropping legitimate investigations for the later tenant. Include tenant in the dedup key (or namespace the key before calling allow) to prevent cross-tenant suppression.
Useful? React with 👍 / 👎.
|
Heads-up while reviewing this PR: the current branch tip doesn't compile in isolation. `internal/graphrag/clustering.go:87` calls `g.vectorIdx.Search(tenant, query, k*2)`, but `internal/vectordb/index.Search` is still the single-arg `(query, k)` signature on this branch. The follow-on vectordb tenant-partitioning changes (which appear to be the intended companion to this PR) seem to have been left out of the commit. Repro: ``` internal/graphrag/clustering.go:87:47: too many arguments in call to g.vectorIdx.Search``` I needed a buildable base to land RAN-38 (#28) on top of, so my local working tree carries the missing vectordb changes; my PR explicitly targets this branch and assumes RAN-37 lands first with those changes folded in. Flagging in case the omission was unintentional — happy to share the WIP I've been carrying if it helps unblock a quick fix here. |
The PR-27 build failed CI because internal/graphrag/clustering.go and internal/mcp/tools.go both called the 3-arg vectordb.Index.Search(tenant, query, k) signature, but that signature lives on RAN-20's vectordb tenant-isolation work and is not yet on main. The 3-arg form leaked in from a stacked branch during the original RAN-37 cut. Restore the 2-arg Search(query, k) call in both sites and leave a TODO(RAN-20) so the proper vector-side tenant scoping lands with the RAN-20 follow-up. RAN-37's in-memory tenant invariants are unaffected: SimilarErrors still narrows results by the per-tenant SignalStore's EMITTED_BY edges, so cross-tenant hits cannot surface even while the underlying vector index is shared. go build / vet / test ./... and -race on graphrag + api all green. Co-Authored-By: Paperclip <noreply@paperclip.ing>
|



Summary
Subtask A of the RAN-19 fix — in-memory and query surface only (persisted tables + merge-gate integration test remain with RAN-38 and RAN-39).
GraphRAGdrops the four direct-store fields and holdstenants map[string]*tenantStoresguarded by its ownRWMutex; slices are lazily materialised viastoresFor(ctx)/storesForTenant(tenant), with the default tenant seeded atNew()so background loops have a baseline.spanEvent/logEvent/metricEventeach carryTenant string;OnSpanIngested/OnLogIngested/OnMetricIngestedpopulate it from the existingstorage.Span.TenantID/storage.Log.TenantID/tsdb.RawMetric.TenantIDfields — callback signatures intentionally unchanged (TSDB and ingest wiring untouched).ctx context.Contextas the first argument and routes throughstoresFor(ctx). Two narrow helpers —AllServiceEdges(ctx)andAnomaliesForService(ctx, service, since)— let external handlers read store data without reaching into the composite.refresh,snapshot,anomaly) iterate a stable snapshot of tenants each tick, wrapctxwithstorage.WithTenantContext, and act per tenant.rebuildAllTenantsFromDBenumerates tenants from thespanstable so tenants without live callbacks are still recovered on startup.SignalStoremaps keep cluster nodes isolated.tools.go) and the APIhandleGetSystemGraphthreadctxinto queries; the 10s system-graph cache is tenant-keyed so cross-tenant cache hits are impossible.PersistInvestigationacceptstenantas its first arg so RAN-38 can wire the column without a signature break. Table schema itself is untouched.Acceptance
grep -i tenant internal/graphrag/hits every file that holds per-tenant maps or accepts queries;schema.go(types only) anddrain.go(shared log-shape miner) correctly do not.go build ./...,go vet ./..., andgo test ./...all green.map[...]*Nodewithout first going throughstoresFor(ctx)— verified by grep and the newTestGraphRAG_TenantIsolation_InMemoryStoresregression.Test plan
go build ./...go vet ./...go test ./...go test -race ./internal/graphrag/...TestGraphRAG_TenantIsolation_InMemoryStores— two tenants ingest overlapping data; each tenant'sServiceMap/DependencyChainreturns only its own services and traces.TestGraphRAG_StoresFor_EmptyContextCollapsesToDefault— empty-context lookups resolve toDefaultTenantID.Out of scope
investigations,graph_snapshots,drain_templates) — Subtask B (RAN-38) owns thetenant_idcolumn migration.internal/graph/legacy package — unchanged perCLAUDE.md.Refs RAN-19 RAN-21